Skip to content

Add: initial prototype of Markdown in Doenet#329

Open
bjones1 wants to merge 1 commit intoDoenet:mainfrom
bjones1:dast-really-markdown
Open

Add: initial prototype of Markdown in Doenet#329
bjones1 wants to merge 1 commit intoDoenet:mainfrom
bjones1:dast-really-markdown

Conversation

@bjones1
Copy link
Copy Markdown

@bjones1 bjones1 commented May 5, 2025

Several tests fail; a quick inspection shows failures related to treating text as Markdown, so I'm ignoring these failures until we decide to move in this direction.

The test I added in dast-basic.test.ts shows that text is correctly parsed as Markdown by normalizeDocumentDast. However, the test-viewer application doesn't render the resulting DAST, even though the resulting DAST tree I print from normalizeDocumentDast seems correct when comparing <p><em>testing</em></p> and its Markdown equivalent, *testing*. Any ideas on why this DAST doesn't render?

The resulting DAST:
image

This renders as:
image

@dqnykamp
Copy link
Copy Markdown
Member

dqnykamp commented May 5, 2025

For the test-viewer, first you want to change the default for usePrototype on line 32 of testViewer.tsx to true so you are always using the new DoenetML core.

Then, the problem is that the prototype doesn't yet have many components implemented. It doesn't look like there is an <em> component yet. And, unfortunately, Doenet is using the strange <alert> tag from PreTeXt rather than <strong>. (I don't think there is a <strong> in PreTeXt, but we could always deviate.)

Obvious, we don't want this behavior, but you could turn your markdown into one of the existing components to test it for now, such <m> or <p>. You can see the small list of components in the new DoenetML in packages/doenetml-worker/src/components/doenet (for ones that are processed by core) or packages/doenetml-prototype/src/renderers/doenet (the React renderers that also catch component types ignored by core).

Or, I just made a PR to your branch adding an <em> renderer using the simple strategy where Doenet core just ignores it for now. With that, "*test*" now renders as italics.

@dqnykamp
Copy link
Copy Markdown
Member

dqnykamp commented May 5, 2025

Also, for easier testing, you could use the test viewer inside the doenetml-prototype package. (That way you don't have to rebuild doenetml-prototype, only its dependent packages, to see changes.) Just use "npm run dev" in packages/doenetml-prototype and edit the testCode.doenet in doenetml-prototype/dev.

The current doenetml-prototype/dev/testCode.doenet seems to have been saved with DoenetML that currently doesn't work in prototype, so you'll have to remove what is in there to get rid of that error.

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 5, 2025

@dqnykamp, fantastic, thanks! That's very helpful -- know where to look for implemented tags, and how to run the viewer makes all the difference. With that, it looks like this is up and running. I noticed that ol and li tags, so this works nicely:

image

What's the next step? That is, given we now have the ability to render text as Markdown in the prototype Doenet viewer, does this make sense as a future Doenet feature? @siefkenj, any thoughts on this?

if (node.type === "text" && visitInfo.parents.length === 1) {
const html = commonMarkRenderer.render(
commonMarkParser.parse(node.value),
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting strategy. It's certainly quick for now, but I think it would be better to convert the AST directly so we have full control over the nodes created (e.g, in code blocks, etc.)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that it's possible we'll be using Markdown not just with text nodes, but in other places (inside code blocks, for example)? If so, then your comment below makes a lot of sense -- define where Markdown belongs and where it doesn't.

@siefkenj
Copy link
Copy Markdown
Contributor

siefkenj commented May 5, 2025

I think the next step is to set up some tests so we can all agree on what should be converted to what :-)

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 5, 2025

I think the next step is to set up some tests so we can all agree on what should be converted to what :-)

Before that, perhaps a bit of discussion? I assume the question is: which Doenet compnents' content should be rendered using Markdown? Which should not? For example, the content of an m component should not be rendered, while the content of a p component should be rendered. Once we agree on the to-render component list, creating tests is fairly straightforward.

Looking at the Doenet component types, any text inside paragraph markup components (alert, aslist, attr, etc.) should be rendered using Markdown. The text inside the following sectional components should be Markdown rendered: activity, aside, caption, cell (or should we instead use markdown tables?), conclusion, definition, example, exercise, hint, introduction, li, note, objectives, ol p, problem, proof, question, section, solution, statement, theorem, title, topic, ul. No other components will be Markdown rendered.

So, given text in the DAST, if the parent of the text is one of the markdown-enabled components, then that text will be Markdown rendered; otherwise, it won't.

Thoughts?

@siefkenj
Copy link
Copy Markdown
Contributor

siefkenj commented May 7, 2025

Didn't we discuss having markdown only work in the top-level components to start?

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 7, 2025

Didn't we discuss having markdown only work in the top-level components to start?

That was the initial discussion and certainly the simplest approach, and (AFAIK) what this PR implements. However, I understood the following to imply this PR should cover more cases (see also my comment):

Very interesting strategy. It's certainly quick for now, but I think it would be better to convert the AST directly so we have full control over the nodes created (e.g, in code blocks, etc.)

So, two questions:

  1. What's next? A series of test cases for text only in the root node? Or something else?

  2. How should we handle Doenet code like this?

    <section>
      Our goal this semester is to learn about encrpytion and decryption...
      
      Less common terms for these are encipherment and decipherment....
    
      An algorithm is a process or set of rules... 
    </section>
    

    IMHO, this text inside the section component should render as Markdown.

@siefkenj
Copy link
Copy Markdown
Contributor

siefkenj commented May 7, 2025

  1. Yes, I think that's what's next. Once we have it working for root notes, extending the rules to work for other nodes should be very easy.

  2. I'll let @dqnykamp comment on this. It's important to know what rules apply when, but it seems starting in any division would be a good place to enable the syntax.

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 7, 2025

  1. Yes, I think that's what's next. Once we have it working for root notes, extending the rules to work for other nodes should be very easy.

Great! As you know, there's one simple test case already:

    it("converts Markdown text to HTML", () => {
        let source: string;
        let dast: ReturnType<typeof lezerToDast>;

        source = `*hi*\n\n# there!`;
        dast = lezerToDast(source);
        expect(toXml(normalizeDocumentDast(dast))).toEqual(
            "<document><p><em>hi</em></p><h1>there!</h1></document>",
        );
    });

I'm assuming more tests would cover Markdown located in some other place in Doenet, but not cover the functioning of Markdown itself. Given that, what other values for source above would be useful to test? AFAIK, the only text in the root node is text outside any other tag, and IMHO this test covers that case. Perhaps source = `Is *markdown*\n<section>\nIs not *markdown*\n</section>\nIs more *markdown*`;?

@siefkenj
Copy link
Copy Markdown
Contributor

siefkenj commented May 8, 2025

Every markdown feature should be tested, ideally in separate tests, and a few that blend some features together.

So: lists, bold, em, links, images, paragraphs (other markdown features I am forgetting?)

I would imagine tests like, "markdown: paragraphs get converted to multiple <p> tags", "markdown: nested bold and emph tags get converted to nested tags", etc.

Then there should be some tests about interleaving formats. So, if a user does Pay _attention to <math>5</math>_ in the following expression, does it emph or does it not? The behaviour should be codified in a test, even if the test is marked as .skip initially because it is not implemented.

One key thing for the tests to nail down is that: markdown cannot produce invalid DoenetML. If there is a feature of markdown that doesn't exist in doenet, it should be converted to its nearest equivalent.

@dqnykamp
Copy link
Copy Markdown
Member

dqnykamp commented May 8, 2025

I think it wouldn't be hard to come up with a small list of component types inside which we'd still apply markdown, such as <section> (and similar), <cell>, <feedback>, <footnote>, <hint>. Do we have different rules for inline tags like <p> and <text>, which shouldn't have paragraphs inside them?

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 9, 2025

I'm working on writing tests for every Markdown component. Following the spec, the first item (section 4.1) is a thematic break, aka <hr/>. For example, the text

Testing

---

More testing

becomes

Testing


More testing

Doenet doesn't support this, however. What should happen in this case?

  1. Would doenet consider supporting it? This seems like a handy way to delineate sections, break up a long explanation, etc.
  2. Pass it through to Doenet, which produces an error.
  3. Refuse to parse a thematic break, leaving it as-in in the text.

I'd prefer choice 1, followed by choice 3. It's likely I can walk the parse tree from Markdown to de-transform this.

Other items that don't translate:

  • Line breaks (<br/>) -- I don't see any equivalent. Refuse to translate? Add to Doenet?

Translation notes:

HTML DoenetML
<h1> through <h6> <section> and <title>
<pre><code> <pre>
<blockquote> <q>
<code> <c>
<a href=> <ref uri=>
<img> <image>

@dqnykamp
Copy link
Copy Markdown
Member

dqnykamp commented May 9, 2025

I've frequently been asked for an <hr/>. I didn't add it because it wasn't in PreTeXt, and I didn't think of it as being semantic. I figured we'd eventually style sections to give someone the option of having a horizontal rule. But, maybe, I should just break down and add it. It's better than folks kludging it with a fake table and a single border!

* Line breaks (`<br/>`) -- I don't see any equivalent. Refuse to translate? Add to Doenet?

I'm not certain about line breaks. What's the use case where paragraphs wouldn't work?

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 9, 2025

I've frequently been asked for an <hr/>. I didn't add it because it wasn't in PreTeXt, and I didn't think of it as being semantic. I figured we'd eventually style sections to give someone the option of having a horizontal rule. But, maybe, I should just break down and add it. It's better than folks kludging it with a fake table and a single border!

Sounds good. For now, I'll just pass the <hr/> through.

* Line breaks (`<br/>`) -- I don't see any equivalent. Refuse to translate? Add to Doenet?

I'm not certain about line breaks. What's the use case where paragraphs wouldn't work?

Here's one place I think it's helpful. Paragraphs in the default style produce two newlines, whereas at line break produces just one.

$\int_{a}^{b} x^2$ is a definite integral where
$a$ is the lower limit,
$b$ is the upper limit, and
$x^2$ is the integrand.

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 9, 2025

I'm making good progress; I've just pushed updated code with a number of tests.

I ran into one minor problem and one major problem:

  1. Major: the DoenetML parser turns text like Given <m>x</m> is... into two text nodes with a math node in the middle. This is bad: Markdown parses the text nodes without knowing there's a math node in the middle of the sentence that the text should be wrapped around. I think our mistake is to use the Doenet parser first; instead, we need to run the Markdown parser, then run that result through the Doenet parser.
  2. Minor problem: Markdown allows autolinks: <http://foo> becomes http://foo. DoenetML interprets this as invalid, so Markdown never gets to look at it.

Thoughts?

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 12, 2025

More thoughts on this: parsing with Markdown first is a difficult and a bad idea. However, the current parsing idea need to change. A sketch:

  1. While walking the parse tree, if a text node's parent is a Markdown-compatible component, then:
  2. Identify the nearest previous sibling which is a block (not an inline) component, and the nearest next sibling which is a block component.
  3. Convert nodes between (but not including) these siblings into XML: if a node is not Markdown-compatible, ensure it's separated by newlines and contains no newlines, so that Markdown will leave it untouched.
  4. Parse with Markdown, convert to DoenetML, then re-parse as Doenet ML.
  5. Replace these nodes with the resulting parse output.

For example, consider the following DoenetML:

<p>This is a block component. This component does not contain Markdown.</p>

<m>x^2</m> is some math in a paragraph that **should** be interpreted as Markdown. 

<choiceInput> (a block component, doesn't contain Markdown)
  <choice>**Agree**</choice>  (These can contain Markdown)
  <choice>Disagree</choice>
</choiceInput>

We have something like this for the parse tree:

Doenet parser drawio

In the diagram, the dotted boxes indicate areas containing Markdown text to be processed as described above.

@siefkenj
Copy link
Copy Markdown
Contributor

2. Minor problem: Markdown allows autolinks: `<http://foo>` becomes http://foo. DoenetML interprets this as invalid, so Markdown never gets to look at it.

I think we just count that syntax as invalid. Markdown also interprets bare links as links, so there is no need for the <...> syntax.

visit(tree, (node, visitInfo) => {
// All text nodes that are children of the root node are interpreted as Markdown.
if (node.type === "text" && visitInfo.parents.length === 1) {
const parse_tree = commonMarkParser.parse(node.value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to work directly with parse_tree here instead of turning it into a string and reparsing it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to look through this code. I appreciate the feedback!

I agree that transforming directly from the CommonMark parse tree to the DAST parse tree would be more elegant. However, I suggest postponing optimizations (such as this) for later, when we're confident this will be adopted. My thought process:

Current approach:

  • Pro: currently works.
  • Con: inefficient -- the parse tree is converted to text, then parsed as DAST, then CommonMark nodes are transformed to DAST.

This suggested approach:

  • Pro: more efficient, since it skips parsing HTML to DAST.
  • Con: requires more code, since items that don't currently need to be translated between HTML and DAST now need code added. Also need code to convert between the CommonMark parse tree and the DAST format.

Notes on this approach: the usage section of the CommonMark docs shows a handy walker function to walk the CommonMark parse tree, and the docs seems pretty good. Is there similar docs on the DAST structure, which would make this easier?

My conclusion:
In this end, this is an optimization, not a change in functionality. Therefore, I'd suggest this as work for later, when we're more confident that this is the right approach.

@siefkenj
Copy link
Copy Markdown
Contributor

Let's focus on the simple situation for now where there is no mixing of XML and Markdown. Handling situations like This _important <m>x</m>_ can wait until we figure out if this is a feature people really need.

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented May 15, 2025

Let's focus on the simple situation for now where there is no mixing of XML and Markdown. Handling situations like This _important <m>x</m>_ can wait until we figure out if this is a feature people really need.

Sounds good. To the best of my knowledge, the current test suit evaluates all Markdown features and either translates them to DoenetML or produces known errors (Markdown autolinks aren't supported; mixed XML and Markdown isn't supported).

The next step AFAIK is to get feedback: is this a feature people need? Therefore, what's the best approach to gather some feedback?

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented Aug 9, 2025

Thoughts on the approach to solving this:

  1. Walk the DAST tree. Classify each node as either not markdown, inline markdown, or block markdown. Combine all inline/block text into a single, large string with separators. Note that the classification of each node should be saved when processing children then restored afterward.
  2. Convert these strings to HTML using Commonmark. Split the string into fragments to apply to each inline/block markdown node. For inline markdown nodes, remove the leading <p> / trailing </p> tags.
  3. Convert each fragment to DAST. Replace the text node with the resulting DAST.

Notes:

  • The separators used should ensure that each inline/block string starts clean (not inside another block, or a fenced code block, etc.)
  • Need a way to match an fragment index (produced by splitting the HTML post Markdown conversion) to a node of the DAST.

@bjones1 bjones1 force-pushed the dast-really-markdown branch from 9cfe50a to 535d930 Compare August 13, 2025 15:26
@bjones1
Copy link
Copy Markdown
Author

bjones1 commented Aug 13, 2025

I worked on this over the weekend, implementing the ideas above. They don't work:

Markdown shouldn't simply replace existing textual content, it should also create new top-level hierarchy, moving nodes as necessary into that hierarchy. For example, the paragraph

This <m>x^2</m> is...

should become

<p>
This <m>x^2</m> is...
</p>

meaning we move the existing <m> tag inside a newly-created <p> tag, not simply add children to existing tags.

I think the best approach is to run the Markdown parser first, then run the DAST parser/tree builder, so we can create this hierarchy before it's transformed into DAST. My sketch:

  1. Do a simple XML parse of the text. Identify each tag's contents as inline markdown, block markdown, inline plain, or block plain. (I think you have a lezer-based XML-tolerant parser?)
  2. Render the parsed results back to text, creating a document containing Markdown text or escaped plain text.
  3. Parse with Markdown.
  4. Render the parsed Markdown to Doenet XML (<code> becomes <c>, <blockquote> becomes <q>, etc.) instead of the default HTML Markdown outputs.
  5. Pass to the DAST parser to a final pass.

Thoughts?

@siefkenj
Copy link
Copy Markdown
Contributor

There is a replaceNode function that should allow you to replace any node with a single node or a sequence of nodes. So if you have

{type: "text", value: `Two Markdown

Paragraphs`}

You should be able to do a replaceNode and return an array of the two paragraphs.

If you focus just on blocks of text, does the approach work? (e.g., excluding your example that has <m>x^2</m> embedded)

@bjones1
Copy link
Copy Markdown
Author

bjones1 commented Aug 13, 2025

Pure blocks of text without XML work.

Getting Markdown to work inside XML block tags (like <p>**Strong** <m>x^2</m></p>) should be possible. It will take a bit more refinement.

Getting block Markdown (blockquote, paragraph, code blocks) which contains XML tags is hard, and probably needs the process I sketched out earlier.

@siefkenj
Copy link
Copy Markdown
Contributor

Yes, markdown with xml in it will be harder...
How about this:

  1. Parse to DAST as usual
  2. If a node's content should allow parsing as markdown, concatenate the text children but for any non-text DAST put in <TEMPORARY-TAG />.
  3. Parse with remark (from unified JS). This should parse <TEMPORARY-TAG /> as a type: "html" node.
  4. Convert the markdown AST back to DAST, and whenever you encounter <TEMPORARY-TAG /> put back the DAST node that should be in that location.

There is a remark playground here, so you can get a feel for their AST: https://remark.js.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants